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SUMMARY 

The  numerical  resolutions  of  the  important  orob- 
lem  of  the  optimal  processing  of  noisy  observa- 
tions to  obtain  optimal  estimates  of  an  underly- 
ina  stochastic  process  or  signal  process  is  con- 
sidered. The  associated  mathematical  problem 
consists  of  the  solution  of  a parabolic  partial 
differential  equation  driven  by  the  stochastic 
observations.  The  solution  of  this  equation  is 
the  conditional  density  of  signal  given  the  ob- 
servations. The  purpose  of  this  paper  is  to 
examine  the  impact  of  modern  parallel  and  pipe- 
line machines,  such  as  CDC  Star  100,  Illiac  IV, 
Cray  1,  and  API 208  Array  Processor  on  the  solu- 
tion time  and  on  the  algorithmic  program 
structure  necessary  to  effectively  use  the 
capabilities  of  each  machine.  The  mathematical 
Problem  suffers  from  the  "curse  of  dimensional- 
ity"; For  example,  when  the  signal  process  has 
state  dimension  two,  the  partial  differential 
equation  has  two  spacial  dimensions  and  taxes 
the  capabilities  of  third  generation  serial 
machines.  This  is  because  our  problem  involves 
hundreds  of  solutions  of  the  partial  differen- 
tial equation  for  different  observation 
sequences,  in  order  to  determine  error  per- 
formance. With  the  advent  of  parallel  and 
pipeline  machines,  higher  state  dimensional 
Problems  became  feasible. 

This  paper  will  detail  the  mathematical  state- 
ment of  our  problem  and  its  general  nature  and 
importance  in  view  of  its  relationship  with 
partial  differential  equations.  We  will  review 
the  special  architectural  features  of  the 
various  machines  with  emphasis  on  those  which  we 
can  exploit  for  our  problem.  Next,  we  will 
describe  the  software  algorithms  developed  for 
each  machine  and  describe  how  the  machine 
capabilities  and  limitations  influenced  the 
structure  of  the  algorithm  for  each  machine. 

The  speed  of  the  various  machines  will  then  be 
compared  for  our  problem.  Finally,  we  will 
describe  the  machine  structure  which  we  feel 
would  be  most  effective  for  our  oroblem  both  in 
terms  of  speed  and  also  compatibility  with 
natural  algorithms  for  problems  of  higher 
space  dimension. 

1 . INTRODUCTION 

We  will  be  interested  in  solving  the  following 
partial  differential  equation; 

dp  = Apdt  + (h-h‘)'R-'{d2-h  dt)p  (1.0) 

with  p:  R''xR'''-»R'^ , h:  r'’-»R®  and  z a stochastic 


Process  takino  values  in  R*.  The  function  h is 
the  inteoral  of  h with  resoect  to  the  condi- 
tional probability  density  p.  In  general,  A can 
be  any  second  order  parabolic  ooerator.  Physical- 
ly, A is  the  adioint  of  the  infinitesimal 
generator  of  the  Markov  diffusion  vector  process 
which  represents  the  signal  nrocess,  x.  The  ob- 
servation process  z is  given  by  dz  = h(x)dt  + dv 
with  u white  noise  with  spectral  matrix  R,  while 
0 is  conditional  probability  density  of  the 
signal  at  time  t given  the  observations  before 
time  t.  Details  and  background  on  how  (1.0) 
arises  can  be  found  in  [i].  In  particular, 

(l.D)  must  be  interpreted  via  the  Stochastic 
Integral  of  Ito  (2). 

For  numerical  ourposes  it  is  convenient  to  con- 
sider a discrete  version  of  the  problem,  where 
we  replace  the  continuous  observations  by  a dis- 
crete sequence  sampled  from  the  continuous  pro- 
cess at  a rate  high  enough  to  insure  that  the 
discrete  problem  is  close  to  the  continuous  one. 
In  this  case  (1.0)  becomes 

Kn+l)  - S*F(n)  (1.1) 

'"n+i  " ■’/Vi  VrVi  (’•2) 

The  first  equation,  the  convolution  of  S with 
Fp,  is  the  density  of  the  measure  of  signal  pro- 
cess A seconds  later  if  signal  process  had 
measure  density  F^  at  time  zero  with  a being 
the  sampling  rate.  Fo  P^+i  is  the  solution  of 
(l.O)  at  time  a if  at  time  zero  P ■ Fp  and  R is 
infinite.  Now  (1.2)  is  approximately  the 
solution  of  (1.2)  at  time  (n+l)A  if  at  time  nA 
0 = Pn+i  and  A = 0.  In  terms  of  the  discrete 
oroblem,  Pn(Fn)  represent  the  condition  density 

of  Xp  given  Zp.i  ...  ZoIZp,  Zn.!,  Zy), 

respectively,  nf  course,  Kp+i  is  chosen  so  that 
Fp+i  has  total  mass  one.  We  note  that  the  rela- 
tionship of  (1.0)  to  (1.1)  and  (1.2)  is 
analogous  to  the  Troller  formula  giving  the  semi- 
group with  infinitesimal  generator  A+B  in  terms 
long  products  of  alternate  applications  of  the 
semi-group  of  A and  that  of  8,  i.e.,  formally, 

exo((A+B]n)  * exo(A)exp(B)...n  times. . .exp(A) 

exp(B). 

The  representation  (1.1)  and  (1.2)  in  fact  has  a 
desirable  property  not  shared  by  direct  differ- 
encing techniques  aoolled  to  (1.0),  namely 
solutions  from  non-neqatiye  Initial  conditions 
are  non-negative.  This  property  is  of  paramount 
importance  for  our  oroblem  as  we  seek  a 
Probability  density  solution  of  (1.0). 

For  the  timing  studies  reported  in  this  paper,  a 
particular  model  was  chosen; 
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X = h(x)  R = ° . 

\ X2  / \sin  xi  / \o  / 


2 2 

A = q/2  3 /3X^  + X 3/3X  . 

2 2 1 

Further  details  concerninq  this  oroblem  can  he 
Found  in  [3]and  [4],  We  chose  to  examine  the 
oerformance  of  the  digital  comouters  considered 
here  in  the  context  of  this  problem  because  of 
the  wide  experience  we  have  had  with  this  prob- 
lem on  a wide  variety  of  machines.  Further, 
because  we  used  a machine  independent  (i.e., 
word-length  independent)  random  number  Generator, 
see  [5],  we  could  directly  compare  computed 
numerical  values  on  every  machine.  This  was 
extremely  useful  for  software  develonment 
ourooses.  This  random  number  oenerator  produces 
statistically  more  reliable  random  numbers  than 
those  Generators  available  as  part  of  supplied 
scientific  subroutine  packages. 


2.  MACHINF  ARCHITFCTIIRF 

The  most  common  approach  to  speedino  up  the 
cpmputatipns  in  serial  machines  involves  the  use 
of  Pipelines,  or  multistaoe  processino.  Pioe- 
linino  involves  the  use  of  seomented  functional 
units,  with  registers  between  segments,  so  that 
manv  identical  functions  can  be  overlaoped  in 
time  at  the  maximum  clocking  rate,  which  is 
determined  by  the  speed  of  the  looic.  Pioe- 
lining  began  to  appear  in  many  third  oeneration 
machine  architectures  in  the  instruction 
fetch-decode-execute  cycles.  Recently,  the 
pipelinino  concent  has  been  extended  to  include 
arithmetic  functions.  The  number  of  staoes  in 
an  arithmetic  pipeline  is  determined  bv  the 
basic  speed  of  the  arithmetic  function  unit  in 
relation  to  the  memory  minor-cycle  time  (the 
interval  between  successive  fetches  to  memory). 
The  combination  of  memory  paqina  and  nioelined 
arithmetic  units  has  come  to  be  known  as  a 
"vector  processor."  This  terminology  reflects 
the  fact  that  optimal  machine  efficiency  is  ob- 
tained by  streaming  long  vectors  of  operands 
through  a single  segmented  arithmetic  unit.  The 
first  implementation  of  this  concent  on  a laroe 
scale  was  introduced  in  the  cnr  Star. 

Although  plBelining  permits  soeed-uo  of  serial 
operations  to  a maximum  by  the  use  of  nartial 
overlap,  the  results  are  not  truly  ayailable  in 
parallel.  The  array  processor,  on  the  other 
hand,  contains  a large  number  of  identical 
functional  units  which  may  be  actuated  in  a 
lock-step  fashion  to  produce  computations  in 
parallel  from  different  sets  of  data.  Arrays 
may  inyolye  primative  functional  units,  such  as 
in  array  multipliers,  or  complete  processing 
elements,  such  as  employed  in  the  T11iac-IV. 

A.  Parallel  and  Array  Processors 

Of  the  machines  we  consider,  the  Illiac-lV  and 
Array  Processor  AP-120B  are  examples  of  dif- 
fering parallel  philosophy.  The  Illiac  consists 


of  fiA  Processing  elements  (P.F.'s)  which  act 
synchronously;  each  P.F.  is  in  fact  a C.P.U. 
with  2K  viords  of  memory.  Fach  P.F.  is  theoret- 
ically capable  of  arithmetic  speeds  similar  to  a 
COC-6600.  Using  Illiac  at  full  capacity  in- 
yolves  making  every  P.F.  do  something  useful  all 
the  time.  In  full  oyerlao  mode,  only  recently 
achieyed,  see  [3],  Illiac  can  be  doing  memory 
fetches,  for  examole,  while  the  P.F.'s  are 
operating.  Illiac  has  a disk  memory  of  24  mil- 
lion 6A-bit  words,  which  must  be  exchanged  with 
local  memory  that  is  limited  to  2K  words  for  I 

each  n.F. 

The  AP-12nR  Array  Processor  has  an  entirely  dif- 
ferent design  philosophy.  It  is  also  a synchro- 
nous machine  but  the  elements  that  operate  in 
Parallel  have  distinct  functions:  they  are 
table  memory,  TM,  data  memory,  MB,  data  pad  x, 
nPX,  data  pad  Y,  npy,  s pad,  SP,  floating  point 
adder,  FA,  floating  point  multiply,  FM,  and  data 
bus,  HR.  The  multiplier  is  a 3-place  pipeline, 
the  adder  is  a 2-Dlace  pipeline,  while  the 
memories  are  3-Dlace  nipeline  and  2-place  pipe- 
line for  the  data  and  table  memory,  respectively. 

The  machine  is  synchronous  with  a 167-nanosecond 
cycle  time,  bata  memory  reads  and  writes  can 
only  be  called  every  other  cvcle  and  reads  and 
writes  are  finished  in  3 cvcles,  while  table 
memorv  reads  and  writes  are  accomplished  in  2 
cvcles.  The  X and  Y data  pads  are  each  32  words 
and  in  one  cycle  reads  from  the  last  position  of 
various  pipes  and  writes  to  the  other  elements 
can  be  accomplished.  Programs  in  the  assembly 
language  of  the  box,  in  order  to  be  efficient, 
should  attempt  to  keen  all  the  elements 
operating  on  each  cycle.  The  S pad  does  integer 
arithmetic  to  set  memorv  read  and  write  ad- 
dresses, to  do  index  computations,  and  to  set 
memory  pointers  for  bPX,  OPY,  SP  and  MO.  The  SP 
consists  of  16  registers.  Oata  words  are  32-bit 
words.  Effectively  the  box  is  capable  of  12 
million  floating  point  operations  per  second. 

The  ma.ior  deficiency  of  the  box  is  the  date 
memorv  access  time,  which  often  tends  to  be  the 
limiting  factor  in  loop  speed.  Software  devel- 
opment is  time-consuming,  although  there  are 
programs  for  the  AP-120.  The  AP-120R  is  inter- 
faced to  a mini -computer  and,  because  of  the 
speed  differential  between  the  box  and  the  host 
mini  (30-40  times  faster  in  the  case  of  the  PDP- 
11-55,  for  example),  if  soeed  is  the  object, 
very  little  computation  can  be  done  in  the  host. 
Reference  [7]  contains  more  details. 

B.  Vector  Machines 

The  CflC  Star-100  and  the  fray  1 are  examples  of 
vector  processors  with  64-bit  words.  In  the  , 

case  of  the  Star-100  speed  is  obtained  by 
streaming.  A vector  is  located  in  consecutive 
memorv  locations  and  memory  nulls  nut  the  con- 
secutive memory  locations  into  a pipeline  in  . 

sequential  order.  For  example,  in  add  VI  to  V2 
the  outputs  of  the  two  memory  pipelines  are  in- 
serted into  the  add  pipeline.  Note,  the  add  pipe 
and  the  memorv  pipes  operate  concurrently  with 
the  write  pipe  cart  of  the  time,  so  that  memory 
fetches  and  writes  are  not  costly.  The  execution 
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time  for  findinq  the  vector  sum  is  100  + n/2 
minor  cycles  where  n is  the  vector  dimension  and 
the  minor  cycle  time  is  40  nanoseconds.  The  100 
cycles  is  set  up  time.  Imediately  two  limita- 
tions are  apparent  to  achieve  maximum  soeed;  n 
must  be  large  compared  to  100;  and  secondlv,  the 
comoonents  of  the  vector  must  be  located  in  con- 
secutive memory  locations.  The  requirement  of 
obtaining  vector  operands  from  sequential 
memory  locations  generates  a significant  data 
selection  and  reorganization  problem.  Oata  re- 
organization alone  accounted  for  86  percent  of 
all  identifiable  overhead,  which  in  turn  repre- 
sented 36  percent  of  the  running  time  for  our 
oroblem.  Although  a FORTRAN-based  language  is 
available  for  the  Star,  some  of  the  more  powerful 
instructions  for  merging  and  compressing  vectors 
were  only  available  by  utilizing  special  in-line 
subroutine-like  linkages  to  the  assembly 
language. 

The  fray  1 has  a faster  minor  cycle  time  {12.5 
nanoseconds)  and  the  memory  pipeline  is  con- 
nected to  64-word  yector  registers,  which  are  in 
turn  connected  to  the  add  pipeline,  which  out- 
puts to  another  vector  register.  If  the  vector 
length  is  more  than  64,  its  elements  are  pro- 
cessed by  reading  blocks  of  64  into  the  eight 
vector  registers.  This  architecture  is  re- 
sponsible for  a nominal  set-up  time  much  lower 
than  that  of  the  Star;  so  that  the  long  vector 
requirement  is  ameliorated  for  the  fray  1. 

Soeed  is  obtained  by  overlapping  the  memory  to 
vector  register,  reads,  writes,  functional  unit 
operation  pipelines,  as  well  as  by  synchronous 
operation  of  different  function  units  when  they 
have  different  input  streams. 

3.  ALRORITHMS  FOR  PARALLEL  NONLINEAR  FILTERS 

The  algorithms  for  all  the  machines  can  be 
viewed  in  simple  parts.  Let  us  consider  ex- 
plicit forms  of  (1.1)  and  (1.2)  as 
N 

P(n'^  .Xi,  yj)  *^E^S(yj-y|()F(n,(x^-Ay|(],y|^)  (3.1) 

where  x^  i=l  ...M  are  subdivisions  of(-Tt,Ti) 

Vj  i*l  ....N  are  subdivisions  of (-it/4,ir/A)  and 
where  F(n,rxi-Ay|(l,  y|()  » aF(n,XH,yi;)+(l-a) 
F(n,XL,yi()  and  a is  the  interpolation  constant, 
x^fx^l  is  the  nearest  grid  point  to  [x^-4y|(] 

(xi-ayv)  mod  Ztt-tt  helow  (above).  The  equa- 
tion (l.l)  takes  the  form  (3.1)  for  the  two 
dimensional  phase  demodulation  problem,  where 
the  densities  have  been  approximated  by  point 
masses  of  height  P(n+1,  x^,yj)  at  point  (xi,yj). 
Numerical  solving  (3.1)  consists  of  three  malor 
parts; 

Rearrangement:  F(n,xi,yj)>F(n,XL,yj)  (3.2) 

Interpolation:  F(n,XL,yi()>F(n,[xi-Ay|,  1,915) 

(3.3) 

Convolutation:  F(n, [Xj-Ay|<] ,y|5)->P(n+l ,x^ ,|^)  ^ 

Then,  in  order  to  complete  one  iteration,  the 
explicit  form  of  (1.2),  namely 

F(n+l,xi,yi)»  -2(0+1, xi,vi)  (3.5) 


must  be  solved.  This  operation  leads  to  two  other 
Parts: 

Oata  Update: 

P(n+l,xi,v^)-n(n+l,xi) -2(0+1 ,xi,Vj)  (3.6) 

Normalization : 

n(n+l,Xi)-P(n+l,xi,Vj).^F(n+l,Xi,xj)  (3.7) 

Explicit  expressions  for  5 and  0 can  be  found  in 
[4],  hut  they  are  not  particularly  useful  here 
except  for  the  fact  they  are  periodic  of  period 
2m/4  and  2ti  respectively,  and  F and  P are  also 
periodic  in  their  first  and  second  arguments  of 
Periods  2it  and  2m/4  resoectively.  It  is  conveni- 
ent to  represent  the  matrix  a^. , by  a vector  of 
(M+i)N  components  such  that: 

v(i+(M+l)(i-l))  = a.. I i < M (3.8) 

= a^l  i=if+l 

If  aj.j  = F(n,xj,xi),  then  there  is  a permutation 
matrix  P,  so  that 

5 = P V (3.9) 

and  5 corresponds  to  F(n,xL,Xj)  via  (3.8),  while 
the  shift  of  5,  k where  k(1)=''(i+l ) , corresponds 
to  F(n,XH,x.|).  Now  (3.9)  accomplishes  the  rear- 
rangement task.  Interpolation  can  be  accom- 
plished by  5tar  vector  multiply,  wherein  the 
component  of  the  product  is  the  product  of  the 
comoonents,  i.e., 

I = S+W*(k-S)  (3.10) 

with  w a vector  of  weights. 

To  prepare  for  the  convolution  the  interpolated 
vector  I reduced  by  removal  of  every  1+1 st  ele- 
ment is  periodically  expanded  as 

E(i+M(1-1))  = Il(i+((j-l+N/2)  mod  N)*M) 

for  i=l  M , i=l  2N,  N even,  and  II  being 

the  after  surgery  vector  of  every  M+lst  element 
or  the  last  element  each  row  in  the  matrix  re- 
presentation removed  of  I.  This  expansion 
eliminates  the  need  for  modular  arithmetic  in 
the  evaluation  of  (3.1)  when  S is  symmetric  and 
has  suooort  contained  within  N grid  points  in  y. 

ir  Star  the  convoluation  can  be  accomplished  by 
the  following  sum: 

S(o)Il  + S(1)(I1(-1)+I1(+1))+  .... 

+S(N/2)(Il(N/2)  + Il(-N/2))  (3.11) 

S(i)  * S(yi)  and  Il(k)  being  a subvector  of  E of 
dimension  N”  consisting  of  contiguous  elements, 
the  first  element  being  l+1(N/2+k)st  element  of 
E.  Note  that  every  element  of  II  is  called  fron 
memory  N+1  times  in  forming  (3.11).  This  method 
is  not  time  consuming  on  Star  because  of  the 
effective  overlap  in  time  of  fetches  from  con- 
secutive memory  locations  with  the  computation. 


For  both  the  Illiac  and  the  AP-120B  the  convolu- 
tion is  performed  by  noting  that  for  each  fixed 
i,  after  F(n,-)  is  rearranged, (3.1)  represents 
a one-dimensional  convolution  in  k.  So  M one- 
dimensional convolutions  are  performed.  Further, 
each  element  of  the  interpolated  vector  II. is 
called  from  memory  only  once  and  each  of  S(i) 
i»l  ...N/2  is  multiplied  by  it  and  each  is 
accumulated  to  produce  the  convolution.  This 
method  of  convolution  was  the  basis  for  develop- 
ment of  the  Illiac-IV  algorithm.  In  that  case 
the  N*128  elements  of  each  row  of  F(n,-)  is 
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cyclically  convolved  with  S(-)  to  produce  a row 
of  P(n+l,x,-).  An  adaptation  of  the  same  method 
for  the  AP-120B  was  suggested  to  us  by  Randy  Cole 
of  the  II.S.C.  Information  Science  Institute;  the 
software  for  the  AP-120B  was  developed  by  Jack 
Mallinkrodt  of  Communications  Research. 

One  should  note  that  our  Star  program  represents 
the  matrix  as  a vector  of  constructed  columns. 

If  one  wished  to  do  a convolution  as  above  for 
Star,  one  would  have  to  either  read  nonconsecu- 
tive  memory  locations  or  rearrange  the  interpo- 
lated vector  to  represent  concatenated  rows.  In 
the  former  case,  the  memory  pipeline  would  be 
inefficient  because  of  the  calls  to  non-contigu- 
ous  memory  locations  and  in  the  latter  case  the 
vector  rearrangement  would  produce  an  eguivalent 
time  penalty.  This  situation  reveals  an  example 
of  the  major  architectural  drawback  of  Star, 
namely,  algorithms  which  use  both  column  and  row 
operation  are  not  effective  for  Star.  Note  that 
(3.6)  is  also  a row-oriented  operation  and  as 
such  will  be  accomplished  differently  on  Star 
than  on  the  other  machines. 

On  the  Star,  (3.5)  is  accomplished  by  construct- 
ino  a vector  corresponding  to  0(n+l,-)  with  the 
first  through  Mth  components  the  value  of  0 and 
continued  periodically,  i.e. 

d(i+«(j-l))  = D(n+l,Xi)  (3.12) 

i = 1 M,  j=l  ....N.  Now  (3.6)  is  accom- 

plished by  a vector  multiply  of  d by  the  vector 
resulting  from  (3.11).  Normalization  is  accom- 
olished  by  dividing  the  vector  corresponding  to 
(3.6)  into  two  vectors  of  the  same  dimension  and 
performing  a vector  add  and  repeating  this  is 
the  number  K(n+1).  This  process  is  effective  if 
NM=2L;  if  this  is  not  the  case  and  a vector  of 
odd  dimension  arises  at  any  stage  of  the  process, 
a zero  is  adjoined  increasing  the  dimension  by  1 
and  the  process  is  continued. 

In  the  case  of  the  Cray  each  of  the  row  vectors 
which  make  up  P(n,-)  are  multiplied  by  the  scalar 
n(n+l,xi)  for  the  appropriate  i.  The  AP-120B 
does  the  appropriate  scalar  version. 

Because  of  the  structure  of  Illiac  with  64  in- 
deoendent  P.E.'s,  N»128  and  M*32  were  the  arid 
sizes  chosen  for  all  problems  in  order  not  to 
oenalize  Illiac.  The  convolution  and  other  row 
oriented  operations  were  performed  in  two  parts 
with  all  P.E.'s  enabled.  The  rearrangement  and 
interpolz  on  were  also  used  for  the  Illiac 
software. 

4.  EXPERIMENTAL  RESULTS 
A.  Language  Optimization 

The  code  for  the  Star-100  was  initially  devel- 
oped in  Star  FORTRAN,  then  each  major  piece  of 
the  program  was  timed  with  hardware  timers  and 
those  pieces  of  code  which  were  major  contrib- 
utors to  the  overall  time  were  examined  in 
assembly  language.  In  particular,  the  rearrange- 
ment was  originally  accomplished  with  an  Indexed 
vector  transfer  instruction  (vx  to  v),  which  had 
two  vector  arguments,  the  first,  t,  correspond- 
ing to  F(n,xf),  yj,  and  the  second  a vector  of 

4 

Meoaflops  - millions  of  floatino 


integers  JNS.  The  value  of  vx  to  v was  the  vector 
corresDonding  to  F(n,X|  ,y^) ,i .e. , t(i)  was  ob- 
tained from  F indexed  bv  JNS(i).  This  Instruction 
was  found  to  be  guite  slow  and  an  assembly 
language  modification  employing  a more  powerful 
version  of  vx  to  v,  an  indexed  block  transfer 
(of  length  M,  for  examole),  was  used.  Similarly, 
a vector  sum  instruction  (sum),  which  has  domain 
a vector  and  range  the  sum  of  its  components, 
was  replaced  by  code  accomplishing  normalization. 
Vector  descriotors  were  also  employed  in  the  pro- 
gram. These  optimization  efforts  produced  code 
which  executed  16.6  menafloos  or  in  other  words, 
about  36  percent  of  the  time  was  devoted  to 
overhead  operations  rearranging,  expanding,  etc. 
In  particular.  Star  ran  five  times  faster  than 
the  7600  on  this  oroblem,  using  the  serial  ver- 
sion of  the  Star  software  with  the  0PT''2  FTN 
level  410  compiler.  Furtner,  the  performance  of 
Star  was  exactly  oredictable  from  individual 
instruction  time. 

The  Illiac-IV  was  coded  in  RLVPNIR  and  assembly 
listings  were  used  to  soeed  uo  the  program.  Our 
Program,  running  in  full  overlap  mode,  where  the 
array  control  unit  (CU)  can  be  operated  simul- 
taneously with  arithmetic  P.E.  operations, 
achieved  9 megaflops.  However,  the  Illiac  is 
currently  operating  with  a 80-nanosecond  minor 
cycle  time,  rather  than  the  design  goal  of  50 
nanoseconds.  Further,  the  actual  time  perform- 
ance of  our  software  suffers  from  Inefficient 
language  optimization,  resulting  from  using  the 
RLYPNIR  language.  It  is  expected  that  use  of 
some  assembly  language  in-line  code  or  a higher- 
order  language  more  sophisticated  than  RLYPNIR 
would  produce  slightly  better  running  time.  We 
will  report  more  on  this  subject  at  a later  date. 
The  AP-120B  Array  Processor  was  programmed  in 
assembly  language  and  the  code  seems  perhaps 
within  10  percent  of  best  possible  code.  Since 
the  machine  is  synchronous,  the  code  can  be  run 
on  a digital  simulator  and  theoretically  timed 
with  the  result  that  it  operated  at  3.58  mega- 
flops and  executed  arithmetic  operations  about 
30%  of  the  time.  In  the  near  future,  we  will 
have  hardware  benchmarks  available.  So  far,  we 
are  attempting  to  benchmark  the  Cray  1 in  the 
following  way:  First,  a very  fast  serial  FORTRAN 
program,  developed  on  the  basis  of  the  Star  code, 
will  be  used  with  the  Cray  1 compiler  to  produce 
code.  We  have  been  assured  that  the  Cray  compiler 
will  vectorize  scalar  code.  Secondly,  another 
FORTRAN  program,  which  does  the  row-oriented 
convolution,  will  be  tested.  Finally,  an 
assembly  language  version  of  our  program  will  be 
prepared  for  the  Cray  1 . We  hope  to  report  on 
some  of  these  timing  studies  at  the  meeting, 
potentially,  the  Cray  could  be  3 - 10  times 
faster  than  Star.  On  the  solution  of  a linear 
system  of  eouations  the  Cray  1 achieved  140 
megafloos--see  (XO)  and  see  Table  1 for  direct 
machine  comparisons.  Finally,  a 3-d 1 mens 1 ona 1 , 
combined  amplitude-phase  demodulator  has  been 
run  on  Star,  with  density  represented  by  a 25K 
dimensional  vector  (see  [3]). 
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We  hope  to  run  the  Cray  1 on  the  same  problem 

and  to  compare  running  time. 
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TABLE  1 

PERFORMANCE  OF  VECTOR  PROCESSORS  ON 
THE  PHASE  MODULATION  PROBLDI 


Meaa- 

flODS/ 

Dollar 

Machine 

Meoa- 

flODS/ 

Theor. 

Meoa- 

flODS 

Actual 

Time 

per 

Iteration 

- 

I111ac4 

64 

9 

9 mllllsec. 

3-5 

Cray  1 

80 

- 

- 

2-4 

Star- inn 

50 

16.6 

4.9  msec. 

16-32 

AP-12nR 

12 

3.57  22.7  msec. 

- 

6600 

2 

.630 

130  msec. 
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